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Agriculture has a key role in the overall economic development of the country. 
Climate change, irregular rainfall, changes in the nutrient content of the soil, 
and other environmental changes are seen as a severe problem in crop yield 
prediction. Using deep learning (DL) models that incorporate multiple factors 
can be viewed as an essential strategy for attaining accurate and effective 
solutions to this issue. The crop yield can be predicted using yield data 
obtained from a historical source that includes information about the weather, 
soil nutrient content, soil type, the season in which the crop was grown, and 
its yield. In order to train the model and achieve high accuracy, a large set of 
data including multiple factors would be required. This research aims to 
forecast the yield of a certain crop using long short-term memory (LSTM) 
time series analysis and the information currently available. The data used to 
construct the models was obtained from a reputable source and contains 
correct numbers. Before growing a crop that has been sown on a piece of 


agricultural land, the yield prediction utilizing advanced methodologies can 
assist farmers predict the yield of a specific crop. 
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1. INTRODUCTION 

Crop yield is considered as a difficult and complex trait that can be determined by various factors 
which are the soil type, physical environment, and the changes occurring in it. To predict crop yield 
continuously, it needs a lot of information that can be used to investigate the relation between obtained crop 
yield and other parameters. In order to understand these dependencies, it is essential to know about the 
extensive datasets as well as the algorithms that might be incorporated [1]. Time can be considered as an 
essential parameter that might be taken into consideration if the model has to forecast anything, be it the 
expected stock price, the amount of crop yield or the amount of rainfall that might fall at a particular location. 
For instance, a condition can be quite fascinating wherein the model can predict the time which would be 
having the most consumption in electricity. It may allow us control the consumption expenses so that we’ll be 
able to produce more electricity during the peak time and could even save resources when not needed [2]. 

To understand time series, we can consider it as a simple continuous data that are arranged based on 
the time. While implementing this method, the role of time is generally considered as a non-dependent entity 
whose main objective generally emphasizes on forecasting. By using time series analysis one can predict the 
future outcomes on the basis of past data [3]. Seasonality is one of the aspects of time series analysis which 
can be referred as a periodic fluctuation. In order to understand it we can take following example into 
consideration, different types of crops grow in different seasons. It could be studied and understood by a 
relation and find if it is in a sinusoidal shape. It can be observed from the complete duration of a season [4]. 
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Stationarity can be considered as one of the most crucial aspect. In the method, it can be called as 
stationary only when all of the statistical attributes remains the same over a period of time. Also, can be inferred 
that there is a constant mean and variance. Therefore, the related variance can be considered as it is independent 
of time. There are various ways in order to model a time series so as to make predictions. Some of them are: 

- Moving average: This model has general approach for the models that are related to time-series. It also 
states that the forth coming values will be the mean of all the values that were observed in the past. 

- Exponential smoothing: It is the model which uses the same logic which is used in moving average but 
the only difference is that variable weights that are arranged in descending orderare given for respective 
observations. Therefore, considerably less significanceis provided to values that were observed as the 
model keeps on processing till the future. One of its kind is double expo. Smoothing which is also 
implemented in this time series model. We use this method only when a simple continuous 
implementation of this smoothing two times is required. 

-  SARIMA: It is model that is having the combination of simple models and combines them in order to 
form a model that is complex enough and can be used in various time series related traits. 

The proposed work is sumfold into two categotries: 

-  Prepocessing-eliminating the null values, reduntant data and discarding the non-relevant factors related 
to crop. 

-  Model-several deep learning (DL) models are applied to evaluate the performance of crop yield 
predictions. The proposed work enhances the long short-term memory (LSTM) method to predict the crop 
yield. 

Section 2 describes about the survey of yield predictions techniques. Section 3 deals about the deep 
learning models and dataset descriptions. Section 4 deals about the proposed reconstruction strategy of LSTM. 
Section 5 deals about the results and discussions of yield prediction. 


2. RELATED WORKS 

Some of the works discussed in the paper [5] had the remote sensing mechanism included for 
gathering the data along with the moderate resolution imaging spectroradiometer (MODIS) satellite imagery. 
Remote sensing images [6] which includes the radiometric calibration and the geological corrections, were 
overcome by the use of the software called environment for visualizing images (ENVI) and the module used 
was fast line-of-sight atmospheric analysis of spectral hypercubes (FLAASH) [7]. It was observed that building 
a single level model for each crop was a better option so as to make things simple [8]. Once all the crop models 
are having a high accuracy in the yield prediction, then merge all the simple models into some complex models 
[9]. Also the calendars [10] based information must also be taken into consideration as one of the parameter to 
keep a track of sowing cycle and to obtain valuable information on land use and crop phenology. Collecting 
ground-based data for training is challenging and takes a lot of time. To overcome that remote sensing based 
satellites [11] started playing an essential role. Kussul et al. [12], predictions on the basis of in-situ data from 
previous year and tuned with neural networks with an accuracy of 85.9%. In the same year, Zhong et al. [13] 
used MODIS normalized difference vegetation index (NDVI) and LSTM [14] for the prediction of vegetation 
dynamics with root mean square error (RMSE) lesser than 0.03. The dataset was divided into 3 segments which 
are as follows training, validation and the testing. Using this method, prediction of any vegetation changes is 
well adapted and precautions can be taken for safety of these crop beforehand [15]. 

To make the model more dynamic, identification of the crops can also be added which was also a key 
feature in the research which was done in the year of 2019. In order to achieve this, various methods of machine 
learning can be used like support vector machines. Once the VI is calculated for a particular soil, it can be used 
for the prediction with a lot of accuracy. A lot of other parameters like leaf area index (LAI) and low-noise 
amplifier (LNA) can also be calculated for a better and more accurate predictions. One of the models like 
source address validation improvements (SAVI) was at a time considered as accurate because it calculated a 
lot of parameters and conditions. The model was faster than the previously made models and also gave a high 
accuracy output [16] also tried adding some additional variables to the already existing conventional methods 
with applications of random forest regression algorithm which gave even better results in terms of accuracy. 
In the same year, Zhou [17] used convolutional neural network (CNN) to design a model based on NDVI and 
RGB for crop prediction from data obtained from unmanned aerial vehicle (UAV). As a result, the CNNs were 
far better than NDVI and red-green-blue (RGB) in performance [18]. Optical sensor imaging is considered as 
an efficient method in order to monitor the various parameters that are necessary to predict the crop yield [19]. 
The data obtained from the optical sensor might sometimes be ineffective due to cloud covering, which can be 
overcome by using synthetic aperture radar (SAR) sensors. 

Trinks and Felden [20] tried to bridge the gap between the optical data and SAR sensors using MCNN- 
Seq. In other words, it is an extension of conventional CNN-recurrent neural networks (RNN). This shows the 
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efficiency of multi layered CNNs over single layered CNNs. The accuracy achieved by this model was higher 
than CNN or RNN in terms of R* and RMSE [21]. Apart from predicting the crop yield, protecting it from crop 
related diseases is equally important. Wang [22] focussed on the diagnosis of the crop yield, a model named 
SqueezeNet with plant village is used and a performance accuracy of 98.49% was achieved [23]. As a result, 
made his focal point to predict the seeding, maturation and harvest dates to maximize the production. Senthil- 
1 satellite was used to extract the time series data [24]. SAR backscattering and interferometric synthetic 
aperture radar (InSAR) coherence provided the crop seeding dates with 85% and harvest dates with 56% 
accuracy respectively. Another research using time series data extracted with Senthil-1 was done by [25], 
conducting an analysis on cross polarisation (VH/VV). This enhanced the prediction of seeding, maturation 
states of the currently growing crop and harvest dates. 

As the years pass, the number of parameters increased and the dependency of the model increased on 
all these parameters and the accuracy of the model increased because features like climate change prediction, 
and various other parameters can be used in order to make our model efficient [26]. To the best of my 
knowledge, the crop yield is highly depend on the phenotype or environmental factors rather than the genotype 
factors. The results of the study [27] also reveals that environmental factors plays a predominant role in the 
crop yield affecting parameters when compared to the genotype factors [28]. Multipath delay commutator fast 
Fourier transform has been proposed for enhancing the throughput [29]. Cooperative routing using the fresher 
encounter algorithm to improve energy-efficiency and solves the node dead issues [30]. 


3. DEEP LEARNING METHODS 
3.1. Dataset details 

For the purpose of training the model and enhancing its efficacy, a high-quality data set is required to 
produce accurate results. Due to the fact that the model's output is produced from the training dataset alone, 
the properties of the dataset and the size of the dataset also play a crucial part in testing the model. The crop 
data should include numerous parameters that can be weighed throughout the feature selection procedure. 

The dataset under consideration contains data from several locations in India. This data set contains 
about eighty thousand records, which were collected between 2014 to 2019. This data collection takes into 
consideration numerous climate and soil characteristics. When calculating a field's crop yield, the field's area 
is taken into account. Crop production in the specific field over the years. Rainfall is a significant aspect in 
agricultural yield forecast since it is a primary climatic factor that, in excess, can harm the crop, and even in 
insufficient amounts, can destroy the yield. This examination of the crop takes into account a variety of seasons. 
Each crop grows throughout a distinct season, which provides it with an appropriate growing environment. 
Temperature is also essential in relation to the type of crop that has been planted. Various types of crops namely, 
Moong, Wheat, Maize, and Urad, are used to analyse and train the model. Also critical are soil conditions; 
different types of soil vary in pH, nitrogen content, and electrical conductivity. Here, for each data point, these 
values are examined because crop health depends on them. 


3.2. Data pre-processing 

Data pre-processing helps to convert raw text data into numerical values. In addition, it also eliminates 
the missing values, and redundant features in the dataset. This research work consist of yield prediction, though 
some of the features in the dataset are represented as text. For eg district, seasons are the feature that are 
represented in text. These categorical values are converted into numbers before applying the DL Models. This 
dataset also consist of some missing values and are eliminated by using python libraries. This dataset does not 
contain any reduntant feature. After completing the pre-processing process, the dataset is ready to apply the 
DL models to estimate the accuracy and finding the best one. 


3.3. Deep neural network (DNN) 

Two models of deep neural network (DNN) namely, recurrent neural networks (RNN) and long short- 
term memory (LSTM) are analyzed in the proposed work for the yield prediction of field crops. Nodes are 
generally considered as the small component that are present in the system. They can be generalized as neurons 
that is present in the brain of a human. As soon as a stimulant interacts them, a response to stimuli takes place 
within these nodes. A lot of them can be related and could be identified by the mark. However, in a general 
scenario, the nodes could further be classified into various layers. The Figure 1 explains a DNN layers. 

The mechanism should handle the layers of knowledge and manage it between the input and output 
in order to successfully complete a task. There are numerous layers present and it's to a particular method 
through which it needs to induce the final output. The more deep the network goes, better it works and all the 
instances are taken into account. A DNN can be considered as helpful when there is a need to reduce the human 
labour. DNN many time compels and intimates that it has a tendency to replace human labour. DNN thus has 
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an autonomous work profile which can hardly compromise its potency. The DNN usage in various sectors will 
make everyone understand its varied applications in real time scenario. 
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Figure 1. Representation of DNN Layers 


3.3.1. Recurrent neural network (RNN) 

The RNN model was created to take into consideration the temporal dependence of agricultural output 
across a number of years. Two contradicting observations necessitated the deployment of the RNN model. On 
the one hand, field crops (maize, wheat, urad. and moong) yield have increased over the last four decades. This 
is partly attributable to the continued improvement of genetics and management practices as a result of 
significant investments in breeding and agricultural techniques. However, genetic information was not 
available to the public for this prediction study. Consequently, utilizing the available data, the effect of 
genotype must be addressed indirectly in the model. This paper addressed the RNN technique to predict the 
yield for different crops based on the environmental factors. The performance of RNN is discussed in the results 
and discussion section. 


3.3.2. Long short-term memory (LSTM) 

LSTM can be considered as an artificial neural network (ANN) that keeps on recurring and is also 
used in the field of deep learning. Sequence problems are considered as one of hardest problem in deep learning. 
These includes variety of problems like stock prediction, sales prediction, and crop prediction. LSTM is edge 
over CNN, RNN and feed-forward neural networks in many ways. LSTM modifies the information by 
multiplications and addition is shown in Figure 2. It has a mechanism in which the information flows, called 
as cell state. 
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Figure 2. Working principle of LSTM 
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Symbols namely orange rectangle represents layers, arrow represent the copy operation, circle 
represents point wise operations. In Figure 2, Xj is a vector considered as input, H;.1 - output of previous cell, 
C..1 - memory of previous cell, H,-output of current cell, C;-memory of current cell, *-multiplication, + - 
addition, W, U-weights. 

These can be classified into three states: 

- Previous State has the information present in the memory after previous step. 
- Previous hidden state gives same output as the previous state. 
- Input state which stores new information. 

Figure 3 represents the LSTM states. Even though the accuracy with the LSTM is higher for weekly 
data and is more reliable than other model but the complexity of the code also increases. A higher number of 
data inputs is also needed for training the dataset used for LSTM. Therefore, DNN is used so that the complexity 
decreases, lesser amount of data is needed to train the model and also it outperformed LSTM for daily data. 
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Figure 3. Representation of LSTM states 


4. PROPOSED METHOD 

The proposed method make use of reconstructive mechanism of LSTM. The proposed architecture is 
shown in Figure 4. Out of all the models analyzed, the LSTM was giving high accuracy when a huge dataset 
was taken into account in various other researches. The use of LSTM in order to make a model that works on 
the large time series data might even give good outputs with a better accuracy. Since the whole time series 
dataset would be having multiple parameters like rainfall, area, season in which the crop was sown, pH value 
of the soil, nitrogen content inside the soil that should be taken into account. 

The methods that are currently running have a lot of flaws be it the lack of accurate data or the 
inadequacy while processing of all the essential parameters that are needed in order to forecast the yield. Also, 
the usage of the current methodologies is refined to some well-developed areas where in the farmers having a 
good source of income can benefit themselves using the models as they are in a good financial state. 
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Figure 4. Proposed architecture of crop yield prediction 


The data pre-processing phase should also be reconstructed. The dataset that is to be used in the 
training process should be filtered in such a way that there should not be any null values. Apart from the null 
values, it has to be taken care of that all the column values are having a same datatype. It is also necessary to 
identify the important attributes that are to be considered in order to train the model accurately. 

Once all the pre-processing is done, the model can be trained using the key attributes using which the 
yield or the total production of the crop can be forecasted. It has to be taken into account that since the dataset 
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is huge, there might be some possibilities of overfitting which might in turn affect the accuracy of the model. 
Some functionality should be added so as to avoid it. As soon as all this is done, the model can be tested on the 
test dataset. 

The values of this test dataset are already known but it is used to check the accuracy of the model and 
if there is a case of less accuracy, then necessary actions could be taken. As soon as a good accuracy is attained, 
the model can be used on the dataset having the essential values of the crops that will be grown in future. Based 
on this data, the model should predict the yield of the crop in a particular season. 


5. RESULTS AND DISCUSSIONS 

The LSTM model has a tendency to handle the information given to it and has the ability to add or 
remove the information that is sent to the cell states. The functionality is regulated by a lot of gates that allows 
the model to train in a much efficient way. So, the model will be taking in the training datasets and will be 
processing all the information. Once all the training processes are carried out, the model can now be tested on 
the datasets having information that is known and the yield value can be cross checked. 

While developing the model in order to forecast the yield, following things must be taken into 
consideration: 

- Input from the user should have the same parameter which was used while training the model. 

- There should be a proper use of the layers in order to get correct outputs and avoid overfitting. 

- Area, season, crop sown and the nutritional values present in the soil should be given importance. Figure 
5 demonstrates the season and production graph. 

Firstly, for analysis and better understanding of the data. A graphical representation of the crop with 
respective production in specific conditions is plotted. During the process of analyzing the dataset of the crop, 
there are a lot of parameters that might play a vital role in predicting the yield for a particular season. Since 
there are a lot of parameters on which the yield of the crop depends upon, it is needed to process the data first 
so that it can be easily used for further analysis. 

Figure 6 shows a Season vs temperature graph. Like this, production graph is plotted for other features 
also like temperature, season and rainfall. The data is then checked for any null values in it. The isna() gives 
the count of any null value present in the training dataset for individual column. In case any null value is found 
it has to be omitted or processed correctly so that it does not affect the prediction. Similarly, same is done with 
the test data. 
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Figure 7 demonstrates a season-rainfall graph. From the training and the testing dataset, production 
column is stored in variables and the column is dropped from the original datasets, with this the dataset becomes 
independent of the production values. Then using one hot encoder and column transformer it is fitted into an 
array which has the values of every row. Similarly, it is repeated for the test dataset. Table 1 described the 
training data, whereas Table 2 illustrates the test data. Tables 1 and 2 clearly depicts the features used to predict 
the crop yield. The output of the dataset is to predict the crop production which is not present in the test data. 
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Table 1. Training data 
Area Production Rainfall Season Temperature Crop pH Nitrogen Electrical conductivity Year 
(kh/ha) (ds/m) 
7800.00 3200.0 30.400333 = Kharif 28.007000 Moong 6.5 497 4.1 2014 
39922.0 75572.0 111.90100 = Khariif = 27.232333 Maize 5.6 473 3.9 2014 
44656.0 49099.0 3.396500 Rabi 20.277000 Wheat 7.3 366 4.9 2014 
6540.0 3945.0 30.932500 Rabi 24.241500 Wheat 5.3 417 3.8 2014 
2911.0 2062.0 189.208333 _ Kharif 27.456333 maize _6.3 267 3.5 2014 
Table 2. Test data 
Area District Season Rainfall | Temperature Crop pH Nitrogen Electrical conductivity Year 
(kh/ha) (ds/m) 
10803  Bairampur Kharif 0.215 0.214 Urad 7.0 267 2.9 2020 
84190 Bairampur Kharif 1.428 1.458 Urad WS) 320 5.6 2020 
43539  Bairampur Kharif 6.952 2.145 Sugarcane 6.7 305 4.0 2020 
90246 Bairampur Winter 2.152 6.952 Wheat 5.7 279 2.1 2020 
18087  Bairampur Whole 8.546 1.976 Wheat 6.4 252 2.5, 2020 


Year 


Figure 8 is plotted based on the multiple crops that were grown in all the six seasons i.e. summer, 
winter, autumn, kharif, rabi and whole year. The x-axis holds the season value and the production value is 
marked on the y-axis. Each colour represents a specific crop. In order to train the model, we have used four 
dense layers having an activation function ‘relu’ and the dropout of 0.2 and 0.3 is added so as to avoid the 
overfitting. Figure 9 attained after plotting the values that we got from the training and testing datasets. The 
equation y=a+b x was taken into consideration wherein x contains the data from the training dataset and y 
contains the data from the test dataset, b is the slope of the line and a is the y intercept value. Figure 8 
demonstratesax_train, y_train linear regression graph. 
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Figure 8. Season wise crop-production bar graph 
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Table 3 accuracy and the loss percentage in test data. Table 3 predicts the accuracy using RNN and 
LSTM neural networks. The model is trained and tested using 10 epochs. After 10 epochs, the accuracy is not 
increasing much. The model is trained till 20 epochs, minimal amount of accuracy gets increased. Table 3 
clearly depicts comparison of accuracy in each epochs using RNN and LSTM networks. Our findings shows 
that both LSTM and RNN achieve satisfactory result in this dataset. But LSTM outperforms the best result 
compared to the RNN. 
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Figure 9. Linear regression of x_train and y_train 


Table 3. Accuracy in test data of each epoch 
Epoch _RNN Accuracy _RNNLoss _LSTM Accuracy _ LSTM Loss 


1 0.81 19.12 0.93 7.01 
2 0.74 26.31 0.86 14.01 
3 0.76 24.14 0.88 12.21 
4 0.82 18.13 0.91 9.12 
5 0.80 20.00 0.92 8.00 
6 0.80 20.00 0.92 8.00 
7 0.80 20.00 0.92 8.00 
8 0.75 25.17 0.87 13.21 
9 0.77 23.14 0.89 11.02 
10 0.81 19.13 0.93 7.01 


6. CONCLUSION 

The study shows that a variety of features are being used in the model proposed by the researchers of 
the various publications that were selected for the study. Each paper focused on the yield prediction of various 
crops using various techniques like RNN and LSTM. Through the study, it was found out that the datasets used 
by various publishers differ in size as well as geographical location thus limiting the features of the model. 
Despite the usage of the various algorithms in the study, it can be concluded that the accuracy given by the 
LSTM algorithm was remarkable when compared with RNN in order to predict the yield of the field crop. The 
accuracy of the model was found out to be around 93% in LSTM. The future directions of this research work 
is to focus on ensemling techniques to minimize the error on this dataset. 
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